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Abstract 


We present a framework for clustering with cluster-specific feature selection. The 
framework, CRAFT, is derived from asymptotic log posterior formulations of non- 
parametric MAP-based clustering models. CRAFT handles assorted data, i.e., 
both numeric and categorical data, and the underlying objective functions are in¬ 
tuitively appealing. The resulting algorithm is simple to implement and scales 
nicely, requires minimal parameter tuning, obviates the need to specify the number 
of clusters a priori, and compares favorably with other methods on real datasets. 


1 Introduction 


We present a principled framework for clustering with feature selection. Feature selection can be 
global (where the same features are used across clusters) or local (cluster-specific). For most real 
applications, feature selection ideally should be cluster-specific, e.g., when clustering news articles, 
the similarity between articles about politics should be assessed based on the language about politics, 
regardless of their references to other topics such as sports. However, choosing cluster-specific 
features in an unsupervised way can be challenging. In fact, unsupervised global feature selection 
is widely considered a hard problem ini. Cluster-specific unsupervised feature selection is even 
harder since separate, possibly overlapping, subspaces need to be inferred. Our proposed method, 
called CRAFT (ClusteR-specific Assorted Feature selecTion), has a prior parameter that can be 
adjusted for a desired balance between global and local feature selection. 

CRAFT addresses another challenge for clustering; handling assorted data, containing both numeric 
and categorical features. The vast majority of clustering methods, like K-means Ha 03, were de¬ 
signed for numeric data. However, most real datasets contain categorical variables or are processed 
to contain categorical variables; for instance, in web-based clustering applications, it is standard 
to represent each webpage as a binary (categorical) feature vector. Treating categorical data as if it 
were real-valued does not generally work since it ignores ordinal relationships among the categorical 
labels. This explains why despite several attempts (see, e.g., El 121111191 El), variations of K-means 
have largely proved ineffective in handling mixed data. 


1 




The derivations of GRAFT’S algorithms follow from asymptotics on the log posterior of its gen¬ 
erative model. The model is based on Dirichlet process mixtures EIIslIISl (see Kim et al. im 
for a prototype model with feature selection), and thus the number of clusters can be chosen non- 
parametrically by the algorithm. Our asymptotic calculations were inspired by the works of Kulis 
and Jordan El, who derived the DP-means objective by considering approximations to the log- 
likelihood, and Broderick et al. a, who instead approximated the posterior log likelihood to derive 
other nonparametric variations of K-means. These works do not consider feature selection, and as a 
result, our generative model is entirely different, and the calculations differ considerably from pre¬ 
vious works. However, when the data are only numeric, we recover the DP-means objective with an 
additional term arising due to feature selection. GRAFT’S asymptotics yield interpretable objective 
functions, and suggest K-means-style algorithms that recovered subspaces on synthetic data, and 
outperformed several state-of-the-art benchmarks on real datasets in our experiments. 


2 The CRAFT Framework 

The main intuition behind our formalism is that the points in a cluster should agree closely on the 
features selected for that cluster. As it turns out, the objective is closely related to the cluster’s 
entropy for discrete data and variance for numeric data. For instance, consider a parametric setting 
where the features are all binary categorical, taking values only in {0,1}, and we select all the 
features. Assume that the features are drawn from independent Bernoulli distributions. Let the 
cluster assignment vector be z, i.e., ^ = 1 if point a;„ is assigned to cluster k. Then, we obtain 

the following objective using a straightforward maximum likelihood estimation (MLE) procedure: 

argmin E E E H(/iL) 

k d 

where denotes the mean of feature d computed by using points belonging to cluster k, and the 
entropy function IHI(p) = —p\ogp — {1 — p) log(l — p) forp G [0,1] characterizes the uncertainty. 
Thus the objective tries to minimize the overall uncertainty across clusters and thus forces similar 
points to be close together, which makes sense from a clustering perspective. 

It is not immediately clear how to extend this insight about clustering to cluster-specific feature se¬ 
lection. GRAFT combines assorted data by enforcing a common Bernoulli prior that selects features, 
regardless of whether they are categorical or numerical. We derive an asymptotic approximation for 
the posterior joint log likelihood of the observed data, cluster indicators, cluster means, and feature 
means. Modeling assumptions are then made for categorical and numerical data separately; this is 
why GRAFT can handle multiple data types. Unlike generic procedures, such as Variational Bayes, 
that are typically computationally intensive, the GRAFT asymptotics lead to elegant K-means style 
algorithms that have following steps repeated: (a) compute the “distances” to the cluster centers 
using the selected features for each cluster, choose which cluster each point should be assigned (and 
create new clusters if needed), and recompute the centers and select the appropriate cluster-specific 
features for the next iteration. 

Formally, the data x consists of i.i.d. D-dimensional binary vectors xi,X 2 , ■ ■ ■ ,xn- We assume 
a Dirichlet process (DP) mixture model to avoid having to specify a priori the number of clusters 
iT+, and use the hyper-parameter 6, in the underlying exchangeable probability partition function 
(EFPF) GH, to tune the probability of starting a new cluster. We use z to denote cluster indicators: 
Zn,k = i if Xn is assigned to cluster k. Since K'^ depends on z, we will often make the connection 
explicit by writing Let Cat and Num denote respectively the set of categorical and the set 

of numeric features respectively. 

The variables Vkd G {0,1} indicate whether feature d G [D] is selected in cluster k G [K], We 
assume Vkd is generated from a Bernoulli distribution with parameter i/^d- Eurther, we assume i/kd 
is generated from a Beta prior having variance p and mean m. 

Eor categorical features, the features d selected in any cluster k have values drawn from a discrete 
distribution with parameters pkdt, d G Cat, where t G Td indexes the different values taken by 
the categorical feature d. The parameters rjkdt are drawn from a Beta distribution with parameters 
ctkdt/K^ and 1. On the other hand, we assume the values for features that have not been selected 
are drawn from a discrete distribution with cluster-independent mean parameters podt- 
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For numeric features, we formalize the intuition that the features selected to represent clusters should 
exhibit small variance relative to unselected features by assuming a conditional density of the form: 


f{Xnd\Vkd) = 

^kd 


{Xnd Ckd) 




{^nd Cd) 

2a‘i 


^kd — 




^kdy/^ '^kd “f \/'^kd 


where Xnd G K, Vkd G {0,1}, and Z^d ensures / integrates to 1, and a^d guides the allowed 
variance of a selected feature d over points in cluster k by asserting feature d concentrate around its 
cluster mean Cfed- The features not selected are assumed to be drawn from Gaussian distributions 
that have cluster independent means Cd and variances cr^. Fig.[^shows the graphical model. 


Let 1(7^) be 1 if the predicate V is true, and 0 otherwise. Under asymptotic conditions, minimizing 
the joint negative log-likelihood yields the following objective (see the Supplementary for details): 


are min Y^ 

z,v,ri,^,a ' ^ ^ ^ 

fc=l n:z^ j, = l dGNum 


'^kdi^^nd Ckd') 

2^L 


+ {X + DFo)K++ Fa, 


Regularization Term 


Numeric Data Discrepancy 


Feature Control 


K+ 

+E E 

k = l d^Cat 


Wfed ^ -I(2:nd = ^)log? 7 fcdt) +(l-'dfcd) ^ Y^ -I(a:nd = t)log? 7 odf 

/ 'n.--Zn,k='>-teTd 


Categorical Data Discrepancy 


where Fa and Fq depend only on the (m, p) pair: Fa = Fi — Fq, with 
Fo = (ao + &o) log(ao + 6o) - oolog oo - &olog 6o, 

Fi = (oi + &i) log(ai + 6i) - ai log oi - &i log 5i, 

TO^(1 — m) to(1 — m)^ 

ao = - m,bo = -h to, ai = oq + 1, and bi = Oq — 1. 

P P 
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This objective has an elegant interpretation. The categorical and numerical discrepancy terms show 
how selected features (with Vkd = 1) are treated differently than unselected features. The regulariza¬ 
tion term controls the number of clusters, and modulates the effect of feature selection. The feature 
control term contains the adjustable parameters: to controls the number of features that would be 
turned on per cluster, whereas p guides the extent of cluster-specific feature selection. A detailed 
derivation is provided in the Supplementary. 

A K-means style alternating minimization procedure for clustering assorted data, along with feature 
selection is outlined in Algorithm [T] The algorithm repeats the following steps until convergence: 
(a) compute the “distances” to the cluster centers using the selected features for each cluster, (b) 
choose which cluster each point should be assigned to (and create new clusters if needed), and (c) 
recompute the cluster centers and select the appropriate features for each cluster using the criteria 
that follow directly from the model objective and variance asymptotics. In particular, the algorithm 
starts a new cluster if the cost of assigning a point to its closest cluster center exceeds (A + DFq), 
the cost it would incur to initialize an additional cluster. The information available from the already 
selected features is leveraged to guide the initial selection of features in the new cluster. Finally, the 
updates on cluster means and feature selection are performed at the end of each iteration. 

Approximate Budget Setting for a Variable Number of Features: Algorithm [T] selects a fraction 
TO of features per cluster, uniformly across clusters. A slight modification would allow Algorithm 
[T] to have a variable number of features across clusters, as follows: specify a tuning parameter 
Cc € (0,1) and choose all the features d in cluster k for which Gd — Gkd > ^cGd- Likewise for 
numeric features, we may simply choose features that have variance less than some positive constant 
ty. As we show later, this slightly modified algorithm recovers the exact subspace on synthetic data 
in the approximate budget setting for a wide range of to. 


3 Discussion 

We discuss special cases and extensions below, which have implications for future work. 
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Algorithm 1 CRAFT 

Input: xi,... ,xn- I?-dimensional input data with categorical features Cat and numeric features 
Num, A > 0: regularization parameter, and m G (0,1): fraction of features per cluster, and 
(optional) p G (0, to(1 — to)): control parameter that guides globaFlocal feature selection. Each 
feature d G Cat takes values from the set Td, while each feature d G Num takes values from M. 

Output: K^: number of clusters, li,..., Ik+- clustering, and ui,..., vk+' selected features. 

1. Initialize = 1, li = {xi,... ,xn}, cluster center (sample randomly) with cate¬ 
gorical mean and numeric mean Ci, and draw vi ~ [Bernoulli (to)] . If p is not 
specified as an input, initialize p = maxjO.Ol, to(1 — to) — 0.01}. Compute the global 
categorical mean po- Initialize the cluster indicators Zn = 1 for all n G [TV], and aid = 1 
for all d G Num. 

2. Compute Fa and Fq using Q- 

3. Repeat until cluster assignments do not change 

• For each point Xn 

- Compute VA: G \K'^] 

dnk= ^ ^ -IjXnd = t)logriOdt 

d^Num d^Oat:vp,fi=0 t^T'd 


I ^ ^ Vkd 1 F/S. + E E -l{Xnd = t) log rjkdt. 

\d —1 / dGCat:vf^d — ^ 

- If imxidnk > (A + DFq), set K'^ = K'^ + 1 , Zn = F+, and draw 

k 

VR+d Bernoulli i — - - - I ^ El) 

where a and b are as defined in 0 . Set rix+ and Ck+ using Xn- Set 
aji+^ = 1 for all d G Num. 

- Otherwise, set = argmindnfc. 

k 

• Generate clusters h,..., Ik+ based on zi,..., zk+'- h = {xn \ Zn = k}. 

• Update the means 77 and (, and variances tr^, for all clusters. 

• For each cluster 1 ^, k G [F+], update Vk- choose the to] A^uto] numeric features 

d' with lowest akd' in h, and choose TO-jCaf] categorical features d with maxi¬ 
mum value of Cd - Gkd, where Gd = - Y.n-.z„,k=i 'dMt 

and Gkd = - lltaTd 


Recovering DP-means objective on Numeric Data 

CRAFT recovers the DP-means objective im in a degenerate setting (see Supplementary): 

K+{z) 

argmin E E E("="^-^fcE + AF+(z), (2) 

A; = l n:2„ k—^ ^ 

where denotes the (numeric) mean of feature d computed by using points belonging to cluster k. 

Unifying Global and Local Feature Selection 

An important aspect of CRAFT is that the point estimate of i/kd is 

/ TO^(1 — to) a 

akd _ V p _^ ^ jvkd - m)p ^ (Vkd, as p -)► to(1 - to) 

akd + bkd to(1 — to) to(1 — to) \to, as p —>• 0. 


P 
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Figure 1; CRAFT- Graphical model. For cluster-specific feature selection p is set to a high value 
determined by m, whereas for global feature selection p is set close to 0. The dashed arrow empha¬ 
sizes this important part of our formalism that unifies cluster-specific and global feature selection. 


Thus, using a single parameter p, we can interpolate between cluster specific selection, p m(l — 
to), and global selection, p —?> 0. Since we are often interested only in one of these two extreme 
cases, this also implies that we essentially need to specify only to, which is often determined by 
application requirements. Thus, CRAFT requires minimal tuning for most practical purposes. 

Accommodating Statistical-Computational Trade-offs 

We can extend the basic CRAFT model of Fig. [^to have cluster specific means ruk, which may 
in turn be modulated via Beta priors. The model can also be readily extended to incorporate more 
informative priors or allow overlapping clusters, e.g., we can do away with the independent distri¬ 
bution assumptions for numeric data, by introducing covariances and taking a suitable prior like the 
inverse Wishart. The parameters a and do not appear in the CRAFT objective since they vanish 
due to the asymptotics and the appropriate setting of the hyperparameter 0. Retaining some of these 
parameters, in the absence of asymptotics, will lead to additional terms in the objective thereby re¬ 
quiring more computational effort. Depending on the available computational resource, one might 
also like to achieve feature selection with the exact posterior instead of a point estimate. CRAFT’s 
basic framework can gracefully accommodate all such statistical-computational trade-offs. 

4 Experimental Results 

We first provide empirical evidence on synthetic data about CRAFT’s ability to recover the feature 
subspaces. We then show how CRAFT outperforms an enhanced version of DP-means that includes 
feature selection on a real binary dataset. This experiment underscores the significance of having 
different measures for categorical data and numeric data. Finally, we compare CRAFT with other 
recently proposed feature selection methods on real world benchmarks. In what follows, the fixed 
budget setting is where the number of features selected per cluster is constant, and the approximate 
budget setting is where the number of features selected per cluster varies over the clusters. We set 
p = to(1 — to) — 0.01 in all our experiments to facilitate cluster specific feature selection. 

Exact Subspace Recovery on Synthetic Data 

We now show the results of our experiments on synthetic data, in both the fixed and the approximate 
budget settings, that suggest CRAFT has the ability to recover subspaces on both categorical 
and numeric data, amidst noise, under different scenarios: (a) disjoint subspaces, (b) overlapping 
subspaces including the extreme case of containment of a subspace wholly within the other, (c) 
extraneous features, and (d) non-uniform distribution of examples and features across clusters. 
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(a) Dataset 


(b) CRAFT 


(c) Dataset 
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Figure 2; (Fixed budget) CRAFT recovered the subspaces on categorical datasets. 
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Figure 3: (Fixed budget) CRAFT recovered the subspaces on numeric datasets. 


Fixed Budget Setting: Fig. |2(a) shows a binary dataset comprising 300 24-feature points, 
evenly split between 3 clusters that have disjoint subspaces of 8 features each. We sampled the 
remaining features independently from a Bernoulli distribution with parameter 0. 1. Fig . |2(b)| shows 


that CRAFT recovered the subspaces with m = 1/3, as we would expect. In Fig. 2(c) we modified 
the dataset to have (a) an unequal number of examples across the different clusters, (b) a fragmented 
feature space each for clusters 1 and 3, (c) a completely noisy feature, and (d) an overlap between 
second and third clusters. As shown in Fig. 2(d) CRAFT again identified the subspaces accurately. 


Fig. 3(a) shows the second dataset comprising 300 36-feature points, evenly split across 3 clusters, 
drawn from independent Gaussians having unit variance and means 1, 5 and 10 respectively. We 
designed clusters to comprise features 1-12, 13-24, and 22-34 respectively so that the first two 
clusters were disjoint, whereas the last two some overlapping features. We added isotropic noise by 
samplin g the remaining features from a Gaussian distribution having mean 0 and standard deviation 
3. Fig. 3(b) shows that CRAFT recovered the subspaces with m = 1/3. We then modified this 
dataset in Fig. 3(c) to have cluster 2 span a non-contiguous feature subspace. Additionally, cluster 
2 is designed such that one partition of its features overlaps partially with cluster 1, while the other 
is subsumed completely within the subspace of cluster 3. Also, there are several extraneous features 
not contained within any cluster. CRAFT recovers the subspaces on these data too (Fig. 3(d) i. 


Approximate Budget Setting: We now show that CRAFT may recover the subspaces even when 
we allow a different number of features to be selected across the different clusters. 

We modified the original categorical synthetic dataset to have cluster 3 (a) overlap with cluster 1, 
and more importantly, (b) significantly overlap with cluster 2. We obtained the configuration, shown 
in Fig. |4(a) by splitting cluster 3 (8 features) evenly in two parts, and increasing the number of 
features in cluster 2 (16 features) considerably relative to cluster 1 (9 features), thereby making 
the distribution of features across the clusters non-uniform. We observed, see Fig. 4(b) that for 
Cc € [0.76,1), the CRAFT algorithm for the approximate budget setting recovered the subspace 
exactly for a wide range of m, more specifically for all values, when m was varied in increments 
of 0.1 from 0.2 to 0.9. This implies the procedure essentially requires tuning only ec- We easily 
found the appropriate range by searching in decrements of 0.01 starting from 1. Fig. 4(d) shows 
the recovered subspaces for a similar set-up for the numeric data shown in Fig. |4(c)| We observed 
that for e„ G [4,6], the recovery was robust to selection of m G [0.1, 0.9], similar to the case of 
categorical data. For our purposes, we searched for ey in increments of 0.5 from 1 to 9, since the 
global variance was set to 9. Thus, with minimal tuning, we recovered subspaces in all cases. 
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Figure 4; (Approximate budget) CRAFT recovered the subspaces on both the categorical dataset 
shown in (a) and the numeric dataset shown in (c), and required minimal tuning. 
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Figure 5; Purity (a-c) and NMI (d-f) comparisons on Splice for different values of m. DP-RF is 
DP-means(R) extended to incorporate feature selection. 


Experimental Setup for Real Datasets 

In order to compare the non-parametric CRAFT algorithm with other methods (where the number 
of clusters K is not defined in advance), we followed the farthest-first heuristic used by the authors 
of DP-means in, which is reminiscent of the seeding proposed in methods such as K-means-|-+ 
@1 and Hochbaum-Shmoys initialization Q: for an approximate number of desired clusters k, a 
suitable A is found in the following manner. First a singleton set T is initialized, and then iteratively 
at each of the k rounds, the point in the dataset that is farthest from T is added to T. The distance of 
a point X from T is taken to be the smallest distance between x and any point in T, for evaluating the 
corresponding objective function. At the end of the k rounds, we set A as the distance of the last point 
that was included in T. Thus, for both DP-means and CRAFT, we determined their respective A by 
following the farthest first heuristic evaluated on their objectives: K-means objective for DP-means 
and entropy based objective for CRAFT. 

Kulis and Jordan HD initialized T to the global mean for DP-means algorithm. We instead chose 
a point randomly from the input to initialize T for CRAFT. In our experiments, we found that this 
strategy can be often more effective than using the global mean since the cluster centers tend to 
be better separated and less constrained. However, to highlight that the poor performance of DP- 
means is not just an artifact of the initial cluster selection strategy but more importantly, it is due to 
the mismatch of the Euclidean distance to categorical data, we also conducted experiments on DP- 
means with random selection of the initial cluster center from the data points. We call this method 
DP-means(R) where R indicates randomness in selecting the initial center. 
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Table 1: CRAFT versus DP-means and state-of-the-art feature selection methods when half of the 
features were selected (i.e. m = 0.5). We abbreviate MCFS to M, NDFS to N, DP-means to D, and 
DP-means(R) to DR to fit the table within margins. DP-means and DP-means(R) do not select any 
features. The number of clusters was chosen to be same as the number of classes in each dataset. 


Dataset 

Average Purity 

Average NMI 

CRAFT 

M 

N 

DR 

D 

CRAFT 

M 

N 

DR 

D 

Bank 

0.67 

0.65 

0.59 

0.61 

0.61 

0.16 

0.06 

0.02 

0.03 

0.03 

Spam 

0.72 

0.64 

0.64 

0.61 

0.61 

0.20 

0.05 

0.05 

0.00 

0.00 

Splice 

0.75 

0.62 

0.63 

0.61 

0.52 

0.20 

0.04 

0.05 

0.05 

0.01 

Wine 

0.71 

0.72 

0.69 

0.66 

0.66 

0.47 

0.35 

0.47 

0.44 

0.44 

Monk 

0.56 

0.55 

0.53 

0.54 

0.53 

0.03 

0.02 

0.00 

0.00 

0.00 


Table 2: CRAFT versus DP-means and state-of-the-art feature selection methods (m = 0.8). 


Dataset 

Average Purity 

Average NMI 

CRAFT 

M 

N 

DR 

D 

CRAFT 

M 

N 

DR 

D 

Bank 

0.64 

0.61 

0.61 

0.61 

0.61 

0.08 

0.03 

0.03 

0.03 

0.03 

Spam 

0.72 

0.64 

0.64 

0.61 

0.61 

0.23 

0.05 

0.05 

0.00 

0.00 

Splice 

0.74 

0.68 

0.63 

0.61 

0.52 

0.18 

0.09 

0.05 

0.05 

0.01 

Wine 

0.82 

0.73 

0.69 

0.66 

0.66 

0.54 

0.42 

0.42 

0.44 

0.44 

Monk 

0.57 

0.54 

0.54 

0.54 

0.53 

0.03 

0.00 

0.00 

0.00 

0.00 


Evaluation Criteria For Real Datasets 

To evaluate the quality of clustering, we use datasets with known true labels. We use two standard 
metrics, purity and normalized mutual information (NMI), to measure the clustering performance 
MM- To compute purity, each full cluster is assigned to the class label that is most frequent in the 
cluster. Purity is the proportion of examples that we assigned to the correct label. Normalized mutual 
information is the mutual information between the cluster labeling and the true labels, divided by 
the square root of the true label entropy times the clustering assignment entropy. Both purity and 
NMI lie between 0 and 1 - the closer they are to 1, the better the quality of the clustering. 

Henceforth, we use Algorithm [T] with the fixed budget setting in our experiments to ensure a fair 
comparison with the other methods, since they presume a fixed m. 


Comparison of CRAFT with DP-means extended to Feature Selection 

We now provide evidence that CRAFT outperforms DP-means on categorical data. We use the 
Splice junction determination dataset im that has all categorical features. We borrowed the feature 
selection term from CRAFT to extend DP-means(R) to include feature selection, and retained its 
squared Euclidean distance measure. Recall that, in a special case, the CRAFT objective degenerates 
to DP-means(R) on numeric data when all features are retained, and cluster variances are all the same 
(see the Supplementary). Fig. [^shows the comparison results on the Splice data for different values 
of TO. CRAFT outperforms extended DP-means(R) in terms of both purity and NMI, showing the 
importance of the entropy term in the context of clustering with feature selection. 


Comparison with State-of-the-Art Unsupervised Feature Selection Methods 

We now demonstrate the benefits of cluster specific feature selection accomplished by CRAFT. Table 
[^and Table 1^ show how CRAFT compares with two state-of-the-art unsupervised feature selection 
methods - MCFS S and NDFS lUl - besides DP-means and DP-means(R) on several datasets 
ED, namely Bank, Spam, Wine, Splice (described above), and Monk, when to was set to 0.5 and 
0.8 respectively. Our experiments clearly highlight that CRAFT (a) works well for both numeric 
and categorical data, and (b) compares favorably with both the global feature selection algorithms 
and clustering methods, such as DP-means, that do not select features. 
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Finally, we found that besides performance, CRAFT also showed good performance in terms of 
time. For instance, on the Spam dataset for m = 0.5, CRAFT required an average execution time 
of only 0.39 seconds, compared to 1.78 and 61.41 seconds by MCFS and NDFS respectively. This 
behavior can be attributed primarily to the benehts of the scalable K-means style algorithm employed 
by CRAFT, as opposed to MCFS and NDFS that require computation-intensive spectral algorithms. 

Conclusion 

craft’s framework incorporates cluster-specific feature selection and handles both categorical and 
numeric data. It can be extended in several ways, some of which are discussed in Section]^ The 
objective obtained from MAP asymptotics is interpretable, and informs simple algorithms for both 
the fixed budget setting (the number of features selected per cluster is fixed) and the approximate 
budget setting (the number of features selected per cluster is allowed to vary across the clusters). 
Code for CRAFT is available at the following website; http : / /www. placeholder . com. 
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5 Supplementary Material 


We now derive the various objectives for the CRAFT framework. We hrst show the derivation 
for the generic objective that accomplishes feature selection on the assorted data. We then derive 
the degenerate cases when all features are retained and all data are (a) numeric, and (b) binary 
categorical. In particular, when the data are all numeric, we recover the DP-means objective CD. 


5.1 Main Derivation: Clustering with Assorted Feature Selection 


We have the total number of features, D = \Cat\ + \Num\. We define SN,k to be the number of 
points assigned to cluster k. First, note that a Beta distribution with mean ci and variance C 2 has 


shape parameters 


ci(l-ci) 


C2 


— Cl and 


ci(l - Ci)^ 
C2 


Cl — 1. Therefore, we can find the shape 


parameters corresponding to m and p. Now, recall that for numeric data, we assume the density is 
of the following form; 


f{Xnd\Vkd) 


^kd 


Qkd^ i^'^nd Cd) 

vkd --+(l-'L’fcd) 






(3) 


where Z}^d ensures that the area under the density is 1. Assuming an uninformative conjugate prior 
on the (numeric) means, i.e. a Gaussian distribution with infinite variance, and using the Iverson 
bracket notation for discrete (categorical) data, we obtain the joint distribution given in Fig. I^for 
the underlying graphical model shown in Fig. [T] 


(x, z, u, Z/, p, C, to) 

= f‘{x\z, V, P, TO, p) 




n n 

fc=l n:z,i_fc = l L \deCat:vkd = iteTd 


n n 

\d£Cat:vkd=0 t^Td 




n _}_p-Y’kd' + 

Zkd' 

d'GNum 

k—1d—1 

K+ r 

n n 


(s 




ri9 + N)l\^ 


C^kdt 

teTd 


n ^kdp 


i^kdt'IK+)-^ 


k=ideCat 

m^(l — to) 

to (1 — to ) 


K+ D T 

nn- 

k—1d—1 


-l] V, 


— m —1 


^kd 


(1 — Vkd) 


( to(1 — to)^ 

’ P ’ 


(4) 


— (2—m) 


r r _ ,1 _ 


Figure 6; Joint probability distribution for the generic case (both numeric and categorical features). 


The total contribution of ([^ to the negative joint log-likelihood 


K+ 

= E E E 

k—1 d£Num n:Zn,k—'^ 


{^nd Ckd) \ i^nd Cd) 

Vkd - 7^3 - 1- (1 - ^kd) 


2(7 


kd 


2^^ 


K+ 


E E 


k—1 d^Num 
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The contribution of the selected categorical features depends on the categorical means of the clusters, 
and is given by 

-logffi n n 

yfc=l n:z„_fc = l d^Cat:vkd = l teTd 

On the other hand, the categorical features not selected are assumed to be drawn from cluster- 
independent global means, and therefore contribute 





-login n n 

, fc=l n:z„_fc = l dGCat:vkd=0 tGTd 


^Tid — 


Thus, the total contribution of the categorical features is 




-E E 

k—l n:Zn k—^ 


E E Kxnd = t) log rjkdt + E E Kxnd = t) log rjodt 

.d£Cat:vkd = it£Td deCat:vkd=0 tGTd 


The Bernoulli likelihood on Vkd couples with the conjugate Beta prior on Vkd- To avoid having to 
provide the value of i^kd as a parameter, we take its point estimate to be the mean of the resulting 
Beta posterior, i.e., we set 


l^kd 


where 


/ m^(l — to) 

V P 

m(l - 



+ Vkd 


P 


^kd 

<kkd + bkd ’ 


m?{l — to) 

Qkd = - m + Vkd, and 

P 

to (1 — to )^ 

bkd = -h TO - Vkd- 

P 

Then the contribution of the posterior to the negative log likelihood is 


(6) 


or equivalently. 


K+ D 

-EE 

k—ld—1 


K+ D 


log 


O^kd 


^kd H" ^kd 


log 


bkd 


V-kd “t” bkd 


bkd 


EE [log (Ofcd + - log - log bl^J^ 


k—l d—1 


F{vkd) 


Since Vkd € {0,1}, this simplifies to 


D 


K-^ D K-^ D 

EE F{vkd) = EE [r;fcd(F(l) - F(0)) + i^(0)] = | ^ ^ | AF + K+DF{Q), (7) 

k—l d—1 k—l d—1 


k k—ld—1 


where AF = F{1) — F{0) quantifies the change when a feature is selected for a cluster. 


The numeric means do not make any contribution since we assumed an uninformative conjugate 
prior over M. On the other hand, the categorical means contribute 


-log 



n Jat:dt'/K+)-l 

'Ikdt' 

t'&Td 


12 



which simplifies to 


K+ 


EE 

k—l d£Cat 


-log 


n 


y / Ctkdt \ 

\K+ ) 


T.{ 

t'&Td 


^kdt' 

K+ 


log r]kdt' 


( 8 ) 


Finally, the Dirichlet process specifies a distribution over possible clusterings, while favoring as¬ 
signments of points to a small number of clusters. The contribution of the corresponding term is 


-log 


qK+-i r ( 6 > -f 1) 




T{0 + N) 


]^(S'jv,fc — 1)! 


/c=l 


or equivalently. 


— — 1) log 9 — log 


r(0 + i) 

T{e + N) 




]^(-S'Ar,fe — 1)! 


(9) 


k=l 


The total negative log-likelihood is just the sum of terms in Q, (j7|, ([^, and (|^. We want to 
maximize the joint likelihood, or equivalently, minimize the total negative log-likelihood. We would 
use asymptotics to simplify our objective. In particular, letting —>■ oo, Vk G \K'^] and d G 
Num, and akdt —>■ , Vt G Td, d G Cat, k G and setting log 0 to 


/ K+ \ 

E E \Td\ - ^ 

^ k—l d^Cat k—l dGNum 

A H- 


K+ - 1 


V 


we obtain our objective for assorted feature selection: 




aigin E E E 

z,v,Tj,C,cr ^ ^ ^ 

k—l n:Zn,k — ^ d^Cat _ 


-Vkd ^ Kxnd = t) Xogrjkdt - (1 - Xkd) ^ ^Xnd = t) log podi 

tGTd tGTd 




E E E 

k—l n:Zn,k—^ d^Num 


Vkd- 


Categorical Data Discrepancy 

(^ntZ Ckd) 


‘^<^ld 


-f (A -f DFq)K'^ 


Regularization Term 


Numeric Data Discrepancy 


' K+ D \ 

■v" 

Feature Control 


where AF = F(l) — F(0) quantifies the change when a feature is selected for a cluster, and we 
have renamed the constants F(0) and AF as Fq and Fa respectively. 


5.1.1 Setting p 


Reproducing the equation for v^d from since we want to ensure that v^d G (Oj 1)^ we must have 

' m^(l — m) 


0 < 


-m] + Vkd 


m(l — m) 


< 1 . 


Since Vkd G {0,1}, this immediately constrains 

p G (0, m(l — m)). 

Note that p guides the selection of features: a high value of p, close to m(l—m), enables local feature 
selection {vkd becomes important), whereas a low value of p, close to 0, reduces the influence of Vkd 
considerably, thereby resulting in global selection. 
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5.2 Degenerate Case: Clustering Binary Categorical Data without Feature Selection 


In this case, the discrete distribution degenerates to Bernoulli, while the numeric discrepancy and the 
feature control terms do not arise. Therefore, we can replace the Iverson bracket notation by having 
cluster means /r drawn from Bernoulli distributions. Then, the joint distribution of the observed data 
X, cluster indicators 2 and cluster means /i is given by 

P(a;, z,p) = F{x\z, fi)F{z)F{^) 


fl n 

k—l n:Zn,k — '^ d—1 


T{e + N) 


K+ 


]^(-5'Ar,fc — 1)! 


k=l 


{A) 


(B) 


K+ D Y 

nn- 






^^kd (1 ^^kd) 


iC) 

The joint negative log-likelihood is 

-\ogF{x,z,n) = -[log(A) + log(S) + log(C')]. 

We first note that 

K+ D 

log (A) = E E E Xnd log Mfcd + (1 - Xnd) l0g(l - fikd) 


k—l n:2n,fc=l d—1 
K+ D 


k—l n:2n,fc=l d—1 
K+ D 

E E E 

k—l n:Zn,k—'^ d—1 

l^kd 


l^kd 
— fdkd 


E E 

log(l - fikd) + k'kd log 

Xnd log ( , ^ - k-kd log 


+ log(l - Mfed) 


fdkd 

1 fdkd 


K+ 


1 — ^kd 
D 


l^kd 

1 — k'kd 


E E E 

k—l n:2^_fc=l d—1 


{Xnd - kkd) log 


^dkd 

1 l^kd 


+ kkd log Mfed + (1 - k‘kd) log(l - iikd) 


K+ 


D 


kkd 


where 


k—ln\Zn,k—^d—l ^ 


Ifll(p) = ~P logp — (1 — p) log(l — p) for p G [0,1]. 


F^i^pkd') 1 


log (B) and log (C) can be computed via steps analogous to those used in assorted feature selection. 
Invoking the asymptotics by letting a K~^, and setting 


= e 


K+D 


- A-r 


log 


K+-1 "\K+ 


we obtain the following objective: 

if+ 

^ z,j(i E E E 

k—l n:Zn^k — '^ d 


Fi-iUkd) + [Pkd - Xnd) log 


Pkd 

1 pkd 


+\K+, 


( 10 ) 


(Binary Discrepancy) 
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where the term (Binary Discrepancy) is an objective for binary categorical data, similar to the K- 
means objective for numeric data. This suggests a very intuitive procedure, which is outlined in 
Algorithm]^ 


Algorithm 2 Clustering binary categorical data 

Input: Xi,... ,xn G {0,1}^: binary categorical data, and A > 0: cluster penalty parameter. 
Output: K^: number of clusters and li,, Ik+' clustering. 

1. Initialize K~^ = 1, li = {xi,... ,xn} and the mean pi (sample randomly from the 
dataset). 

2. Initialize cluster indicators Zn = 1 for all n G [A^]. 

3. Repeat until convergence 

• Compute VA: G [Ar+],d G [D\. 

IHI(pfcd) = -Hkd'^ogHkd - (1 - Ai/cd)log(l - /tm). 


• For each point Xn 

-Compute the following for all k G [AT+J: 


D 


dnk 


EI(Mfed) + {k'kd - Xnd) log 







- If mindnfe > A, set Ar+ = Ar+ + 1, = A'+, and 

k 

- Otherwise, set = argmind^fc. 


Generate clusters li,..., based on zi,..., Zx:+' h = {xr, 
For each cluster Ik, update /i^ = x. 


= k}. 


xeh 


In each iteration, the algorithm computes “distances” to the cluster means for each point to the 
existing cluster centers, and checks if the minimum distance is within A. If yes, the point is assigned 
to the nearest cluster, otherwise a new cluster is started with the point as its cluster center. The 
cluster means are updated at the end of each iteration, and the steps are repeated until there is no 
change in cluster assignments over successive iterations. 

We get a more intuitively appealing objective by noting that the objective ( [T0| ) can be equivalently 
written as 

K+ 

argmin E E E®(dL)+ AA+, (11) 

k = l n:z.ri k — ^ d 

where denotes the mean of feature d computed by using points belonging to cluster k. charac¬ 
terizes the uncertainty. Thus the objective tries to minimize the overall uncertainty across clusters 
and thus forces similar points to come together. The regularization term ensures that the points do 
not form too many clusters, since in the absence of the regularizer each point will form a singleton 
cluster thereby leading to a trivial clustering. 


5.3 Degenerate Case: Clustering Numerical Data without Feature Selection 

In this case, there are no categorical terms. Furthermore, assuming an uninformative conjugate prior 
on the numeric means, the terms that contribute to the negative joint log-likelihood are 


—Cfcd'—Cd' 

fc = l d' 


and 
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K^-i rje + i) 
T{e + N) 


K+ 


]^(S'Ar,fe — !)!■ 


k=l 
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Taking the negative logarithms on both these terms and adding them up, setting log 9 to 


/ K+ \ 

^ ^ log Zkd 

k=l d' 


A + 


K+ - 1 


V 




and Vkd' = 1 (since all features are retained), and letting ad' —>■ oo for all d', we obtain 


argmin 

Z 


K+ 


E E E 

k—l n:Zn,k — '^ d 


(Xnd Ckd) 


2a 


*2 

kd 


+ XK+, 


( 12 ) 


where and are, respectively, the mean and variance of the feature d computed using all the 
points assigned to cluster k. This degenerates to the DP-means objective imi when alj^ = 1/^/2, for 
all k and d. Thus, using a completely different model and analysis to im, we recover the DP-means 
objective as a special case. 
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